Pedestrian Foot Traffic Business Case
authored by: Ngoc Dung Hyunh
Duration: 40 mins
Level: Advanced Pre-requisite Skills:Python
Scenario
What this Use Case will teach you
At the end of this use case you will:
A brief introduction to the datasets used
The City of Melbourne provides comprehensive pedestrian traffic data, as well as sensor location data. In this use case we will utilise the Pedestrian Counting System - Past Hour (counts per minute) dataset
This dataset contains minute-by-minute directional pedestrian counts from the last over from pedestrain sensor devices located across the city. The data is updated every 15 minutes and can be used to determinate variations in pedestrian activity throughout the day.
We also utilise the City of Melbourne's Sensor Locations dataset. This data will be used to extract the location of these sensors which helps with visualisation.
Accessing and Loading data
We aim to make a decision that will be informed by insights based on the latest data.
In order to get this data, we can create a function to extract, transform and load (ETL) pedestrian traffic data every 15 minutes.
First, we will do ETL for Sensor Locations and Pedestrian Counting System - Past Hour (counts per minute).
We will then merge these datasets.
Note: We use the package sodapy to extract from Melbourne Open Data directly. This package is a python client for the Socrata Open Data API. To extract the data from Melbourne Open Data, you must have a dataset id. It can be found as follows:

from sodapy import Socrata
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import display, HTML
##function to get Sensor Location Data
def sensor_location():
client = Socrata('data.melbourne.vic.gov.au', None)
sensor_location_data_id = "h57g-5234"
results = client.get(sensor_location_data_id)
df = pd.DataFrame.from_records(results)
sensor_location = df[["sensor_id", "sensor_description", "latitude", "longitude"]]
sensor_location.columns = ["Sensor ID", "Sensor Description", "lat", "lon"]
sensor_location["lat"] = sensor_location["lat"].apply(lambda x: float(x))
sensor_location["lon"] = sensor_location["lon"].apply(lambda x: float(x))
return sensor_location
##function to get the City of Melbourne's Pedestrian Counting System - Past Hour (counts per minute)
def fifty_minute_count(location):
#Extract data
client = Socrata('data.melbourne.vic.gov.au', None)
fifty_minute_count_data_id = "d6mv-s43h"
results = client.get(fifty_minute_count_data_id)
df = pd.DataFrame.from_records(results)
df = df[["date_time", "sensor_id","total_of_directions"]]
#Show the latest update
df.columns = ["DateTime", "Sensor ID", "Count"]
weekdays = {
0: "Monday",
1: "Tuesday",
2: "Wednesday",
3: "Thursday",
4: "Friday",
5: "Saturday",
6: "Sunday"
}
times = df["DateTime"]
last_time = times.sort_values(ascending=False).iloc[1].split("T")
weekday = time.strptime(last_time[0],"%Y-%m-%d").tm_wday
print(f"The latest data was updated on {weekdays[weekday]} {last_time[0]} at {last_time[1].split('.')[0] } at https://data.melbourne.vic.gov.au")
#Tranfrom data
df["Count"] = pd.to_numeric(df["Count"])
counting = df[["Sensor ID", "Count"]].groupby(["Sensor ID"]).sum()
counting.reset_index(level=0, inplace=True)
location["Sensor ID"] = location["Sensor ID"].apply(lambda x: str(x))
#Merge 2 dataset
counting = pd.merge(counting, location, on='Sensor ID', how='inner')
counting["Count"] = counting["Count"].apply(lambda x: float(x))
return counting
#ETL Sensor Location
location_data = sensor_location()
# Pedestrian Counting System - Past Hour (counts per minute) and merge with Sensor Location
counting_data = fifty_minute_count(location_data)
#Show 5 first columns
counting_data.head(5)
| Columns | Description | Type |
|---|---|---|
| Sensor ID | Unique reading ID | Categorical |
| Count | Hourly sum of Pedestrians | Numerical |
| Sensor Description | A description of where the sensor is located | Categorical |
| Lat | Latitude of each sensor | Numerical |
| Lon | Longitude of each sensor | Numerical |
Using Selenium to Crawl Data
To help drive insight to informing our business scenario, we can compare the data from two different sensor locations.
This means we can compare pedestrian foot traffic in two separate locations over different periods of the current date, as well as over a 4-week period.
However, in order to get this data, we need to crawl the data from http://www.pedestrian.melbourne.vic.gov.au.
Note Because we cannot use Sodapy to crawl the data from this website, We use another package Selenium to extract the data.

#Set up Selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from datetime import date
import numpy as np
import csv
from csv import reader
import matplotlib.pyplot as plt
import os.path
if not os.path.isfile("chromedriver"):
!wget "https://chromedriver.storage.googleapis.com/94.0.4606.41/chromedriver_linux64.zip"
!unzip chromedriver_linux64.zip
chromeOptions = webdriver.ChromeOptions()
prefs = {"download.default_directory" : "/home/parker/Melbourne_Pedestrian_Data/data/"}
chromeOptions.add_experimental_option("prefs",prefs)
def daily_monthly_data():
#Crawl data
today = date.today()
today = today.strftime("%d-%m-%Y")
file_name = f"COM_24PM_{today}.csv"
if os.path.exists("data/" + file_name):
os.remove("data/" + file_name)
driver = webdriver.Chrome(executable_path='./chromedriver', options=chromeOptions)
driver.get("http://www.pedestrian.melbourne.vic.gov.au/")
driver.find_element_by_xpath(f'//*[@id="dropdown-size-small"]').click()
sleep(2)
driver.find_element_by_xpath(f'/html/body/div/div/div[5]/div[2]/div/ul/li/a').click()
sleep(5)
#Cleaning data
with open("data/" + file_name, 'r') as read_obj:
csv_reader = reader(read_obj)
daily_data = {}
monthly_data = {}
yearly_data = {}
i = 0
for row in csv_reader:
if i in range(9, 87):
daily_data[row[0]] = row[1:]
elif i in range(91,169):
monthly_data[row[0]] = row[1:]
i+=1
for sensor in daily_data:
if len(daily_data[sensor]) == 0:
daily_data[sensor] = [np.nan]*24
for sensor in monthly_data:
if len(monthly_data[sensor]) == 0:
monthly_data[sensor] = [np.nan]*24
#Return daily data and monthly data of sensors
daily_data = pd.DataFrame(daily_data)
monthly_data = pd.DataFrame(monthly_data)
daily_data = daily_data.replace("N/A", np.nan)
monthly_data = monthly_data.replace("N/A", np.nan)
for column in daily_data.columns:
daily_data[column] = pd.to_numeric(daily_data[column])
daily_data = daily_data.replace(0, np.nan)
monthly_data[column] = pd.to_numeric(monthly_data[column])
os.remove("data/" + file_name)
return daily_data, monthly_data
daily_data, monthly_data = daily_monthly_data()
#Rename coloumns of daily and monthly data because some sensors have different names on different datasets.
columns = {"Collins St (North)": "Collins Street (North)",
"Lincoln-Swanston (W)":"Lincoln-Swanston (West)",
"RMIT Bld 80 - 445 Swanston Street": "Building 80 RMIT",
"Flinders Ln - Degraves St (South)": "Flinders Ln -Degraves St (South)",
"Flinders Ln - Degraves St (North)": "Flinders Ln -Degraves St (North)",
"Flinders Ln - Degraves St (Crossing)": "Flinders Ln -Degraves St (Crossing)",
"Flinders St - ACMI": "Flinders St- ACMI",
"Spring St - Flinders St (West)": "Spring St- Flinders st (West)",
"Macaulay Rd - Bellair St": "Macaulay Rd- Bellair St",
"Harbour Esplanade - Pedestrian Path": "Harbour Esplanade (West) - Pedestrian path",
"Harbour Esplanade - Bike Path": "Harbour Esplanade (West) - Bike Path",
"Flinders La-Swanston St (West) Temporary": "Flinders La - Swanston St (West) Temporary",
"380 Elizabeth St": "Flinders St-Swanston St (West)"
}
daily_data.rename(columns=columns, inplace=True)
monthly_data.rename(columns=columns, inplace=True)
#Show 5 columns of daily_data (updated every an hour)
daily_data.head()
#Show 5 columns of daily_data (updated every an hour)
monthly_data.head()
Basic Map Visualisation
Visualising data provides an easy way for us to detect patterns within the data. Using the python library folium, we can create a live pedestrian traffic map of the City of Melbourne.
This map will represent all of the data of the pedestrian sensor locations within the City of Melbourne.
This map is updated every 15 minutes for up-to-date pedestrian traffic information.
import folium
from folium.plugins import MarkerCluster
def map_visualization(data):
locations = []
for i in range(len(data)):
row =data.iloc[i]
location = [(row.lat,row.lon)]*int(row.Count)
locations += location
marker_cluster = MarkerCluster(
locations=locations,
overlay=True,
control=True,
)
m = folium.Map(location=[-37.8167, 144.967], tiles="Cartodb Positron", zoom_start=15)
marker_cluster.add_to(m)
folium.LayerControl().add_to(m)
m
return m
map_visualization(counting_data)
As we can see, the majority of the pedestrain traffic takes place around Swanston street in the heart of the city
Basic Trend Visualisation
Having developed a basic visualisation of our data, we can now produce a visualisation that shows us potential trends in pedestrian foot traffic.
We can create a line chart that consists of two lines of different sensor locations at different times of day.
Based on this visualiation, we can compare the rush hour and off-peak periods of the two sensors.
This information has the potential to help inform our business owner of his new potential location.
#Function to plot a line chart.
def line_chart(daily_data_sensor_1 , monthly_data_sensor_1, daily_data_sensor_2, monthly_data_sensor_2, sensor_1, sensor_2):
hours = [f"0{i}:00" for i in range(0, 10)] + [f"{i}:00" for i in range(10, 24)]
plt.figure(figsize=(20, 10))
plt.plot(hours, daily_data_sensor_1, label = f'Hourly count ({sensor_1})', linewidth=2, c = "r")
plt.plot(hours, monthly_data_sensor_1,'--', label = f'4 Week average {sensor_1}))', linewidth=2, c="r")
plt.plot(hours, daily_data_sensor_2, label = f'Hourly count ({sensor_2})', linewidth=2, c = "b")
plt.plot(hours, monthly_data_sensor_2, "--",label = f'4 Week average {sensor_2}))', linewidth=2, c = "b")
labels = [f"Hourly count ({sensor_1})", f'4 Week average ({sensor_1})', f"Hourly count ({sensor_2})", f'4 Week average ({sensor_2})']
plt.legend(labels, prop={'size': 20})
plt.grid(color='g', linestyle='-', linewidth=0.1)
plt.xticks(hours)
plt.title(f"Hourly Pedestrian Count at ({sensor_1}) and ({sensor_2})", fontsize=20)
plt.xlabel(f"Hour", fontsize=20)
plt.ylabel(f"Pedestrians", fontsize=20)
#Function to filter dataset by Sensor ID
def trend_visualization(sensor_1, sensor_2, daily_data = daily_data, monthly_data = monthly_data):
sensor_1 = location_data[location_data["Sensor ID"] == str(sensor_1)]["Sensor Description"].iloc[0]
sensor_2 = location_data[location_data["Sensor ID"] == str(sensor_2)]["Sensor Description"].iloc[0]
daily_data_sensor_1 = daily_data[sensor_1]
monthly_data_sensor_1 = monthly_data[sensor_1]
daily_data_sensor_2 = daily_data[sensor_2]
monthly_data_sensor_2 = monthly_data[sensor_2]
line_chart(daily_data_sensor_1 , monthly_data_sensor_1, daily_data_sensor_2, monthly_data_sensor_2, sensor_1, sensor_2)
#Show line chart of sensor 1 and 4
trend_visualization(1, 3)
Based on this visualisation, if we want to open a business that caters for the lunch time crowd then we may wish to locate our business near Bourke street mall.
However, if we want to open a business that caters to the Dinner crowd, perhaps the area in and around Melbourne Central provides better incentives.
This goes to show just how dynamic life is in the City of Melbourne
Geographic Filter
Sometimes, we require more specific information that may not be obtainable from the sensor information alone. In this example, we take specific addresses as data input instead of the Sensor location to provide a more meaningful and insightful result.
We can create a Geographic Filter that will take a specific address such as 100 Flinders Street and create a live map and a live line chart.
The problem is that an address may have multiple pedestrian sensors near it. Therefore, we introduce another variable called Radius. This variable aims to filter all sensors that are within the radius of the target address
We then take a sum of these different sensors to access the pedestrian data of the address.
To produce this filter, we have 8 steps :
import googlemaps
import requests
import json
import numpy as np
import geopy.distance
from folium.plugins import MarkerCluster
API_key ='AIzaSyBdKoKpj2bvmSXnYh1stvgF0eDpffQi-j8'
gmaps = googlemaps.Client(key=API_key)
#This function takes an address and returns the address and it's latitude and longitude
def extract_location(fitler_address_value):
address_name = f'{fitler_address_value}, Melboune, Victoria, Australia.'
geocode_result = gmaps.geocode(address_name)
location = geocode_result[0]["geometry"]["location"]
return address_name, np.array([location["lat"], location["lng"]])
#This function takes the whole dataset, address, and radius_value (meter)
#and return the data of sensors that are near from the address
def geo_filter_data(dataset, fitler_address_value, radius_value):
specific_address = extract_location(fitler_address_value)
counting_data_filter = dataset
counting_data_filter["distance"] = counting_data_filter[["lat", "lon"]].apply(lambda row: distance(row, specific_address[1]), axis = 1)
counting_data_filter = counting_data_filter[counting_data_filter["distance"] <= radius_value]
location = [specific_address[1][0], specific_address[1][1]]
return counting_data_filter, location
def map_visualization(data, filter_sensors, center, filter_values, radius_value):
if filter_values == []:
locations = []
for i in range(len(data)):
row =data.iloc[i]
location = [(row.lat,row.lon)]*int(row.Count)
locations += location
marker_cluster = MarkerCluster(
locations=locations,
overlay=True,
control=True,
)
m = folium.Map(location=[-37.8167, 144.967], tiles="Cartodb Positron", zoom_start=15)
marker_cluster.add_to(m)
folium.LayerControl().add_to(m)
return m
else:
m = folium.Map(location=list(center), tiles="Cartodb Positron", zoom_start=15)
locations = []
for location in filter_sensors:
sensor_ids = filter_sensors[location]["sensor id"]
center_i = filter_sensors[location]["location"]
data_filter= data[data["Sensor ID"].isin(sensor_ids)]
for i in range(len(data_filter)):
row =data_filter.iloc[i]
location_i = [(row.lat,row.lon)]*int(row.Count)
locations += location_i
label = folium.Popup(location, parse_html=True)
folium.Marker(location=list(center_i), popup=label).add_to(m)
folium.Circle(
location=list(center_i),
radius=radius_value,
color='red',
fill=True,
fill_color='#ffffff00',
fill_opacity=0.7,
parse_html=True).add_to(m)
marker_cluster = MarkerCluster(
locations=locations,
overlay=True,
control=True,
).add_to(m)
folium.LayerControl().add_to(m)
return m
#This function caculate the distance between the address and sensors
def distance(row, address):
return geopy.distance.geodesic((row[0], row[1]), (address[0], address[1])).m
#This function plots the line chart
def line_chart(daily_data, monthly_data, locations, fitler_address_value):
hours = [f"0{i}:00" for i in range(0, 10)] + [f"{i}:00" for i in range(10, 24)]
plt.figure(figsize=(20, 10))
#If there are no geo filter value, this function will return a top 3 pedestrian sensor counts
if fitler_address_value == []:
labels = []
title = f"Top 3 Hourly Pedestrian Count Sensors in Melboune, Victoria, Australia"
colors = ["r", "y", "b"]
for i, sensor_id in enumerate( locations):
name = location_data[location_data["Sensor ID"] == sensor_id]["Sensor Description"].iloc[0]
daily_data_i = list(daily_data[name])
monthly_data_i = list(monthly_data[name])
plt.plot(hours, daily_data_i, label = f'Hourly count at {name} today', linewidth=2, c = colors[int(i)])
plt.plot(hours, monthly_data_i,"--", label = f'One month average at {name}', linewidth=2, c = colors[int(i)])
labels.append(f'Hourly count at {name} today')
labels.append(f'One month average at ({name})')
plt.legend(labels, prop={'size': 13})
plt.grid(color='g', linestyle='-', linewidth=0.1)
plt.xticks(hours)
plt.title(title, fontsize=20)
plt.xlabel(f"Hour", fontsize=20)
plt.ylabel(f"Pedestrians", fontsize=20)
plt.show()
#If there are geo filter values, this function will return the sums of all sensors around specific addresses
else:
labels = []
title = f"Pedestrian Data Analysis at "
colors = ["r", "y", "b", "black", "pubble"]
#For each addrress of input addresses
for i, location in enumerate(locations):
sensor_id = locations[location]["sensor id"]
sensor_names = []
#get all sensor's name of each address
for sensorID in sensor_id:
name = location_data[location_data["Sensor ID"] == sensorID]["Sensor Description"].iloc[0]
sensor_names.append(name)
#Filter daily and monthly data
filter_daily_data = daily_data[sensor_names]
filter_monthly_data = monthly_data[sensor_names]
#Get the sum of all sensor around a address
filter_daily_data["SUM"] = filter_daily_data.sum(axis=1).replace(0, np.nan)
filter_monthly_data["SUM"] = filter_monthly_data.sum(axis=1)
#Add daily line and monthly line for each address
plt.plot(hours, filter_daily_data["SUM"], label = f'Hourly count at {location} today', linewidth=2, c = colors[int(i)])
plt.plot(hours, filter_monthly_data["SUM"],"--", label = f'One month average at {location}', linewidth=2, c = colors[int(i)])
labels.append(f'Hourly count at {location} today')
labels.append(f'One month average at {location}')
plt.legend(labels, prop={'size': 13})
plt.grid(color='g', linestyle='-', linewidth=0.1)
plt.xticks(hours)
plt.title(title, fontsize=20)
plt.xlabel(f"Hour", fontsize=20)
plt.ylabel(f"Pedestrians", fontsize=20)
plt.show()
#This is the main part of visualization
def visualization(fitler_address_values, live_counting_data = counting_data,radius_value = 200):
if fitler_address_values == []:
center = [ -37.8167, 144.967]
filter_sensors = None
m = map_visualization(counting_data ,filter_sensors, center, fitler_address_values, radius_value)
top_3_sensor_IDs = list(live_counting_data.sort_values(by=['Count'], ascending=False)["Sensor ID"])[:3]
line_chart(daily_data, monthly_data, top_3_sensor_IDs, fitler_address_values)
return m
else:
centers = []
locations = {}
for address in fitler_address_values:
data_filter,lat_long = geo_filter_data(live_counting_data,address, radius_value)
sensor_id = list(data_filter["Sensor ID"])
locations[address] = {}
locations[address]["sensor id"] = sensor_id
locations[address]["location"] = lat_long
centers.append(lat_long)
center = np.mean(centers, axis=0)
m = map_visualization(live_counting_data,locations, center, fitler_address_values, radius_value)
line_chart(daily_data, monthly_data, locations, fitler_address_values)
return m
We can visualize Pedestrian foot traffic without an address input. The function will return a live pedestrian traffic map in Melbourne and a line chart showing the three busiest pedestrian sensors.
fitler_address_values = []
visualization(fitler_address_values , live_counting_data = counting_data,radius_value = 200)
We now analyze pedestrain traffic at 3 locations: Flinders Street Station, Southern Cross station and Melbourne Central Station.
fitler_address_values = ["Flinder Station", "southcross station", "melbourne central station"]
visualization(fitler_address_values, live_counting_data = counting_data,radius_value = 300)
We can even add more addresses to analyze pedestrian traffic at 4 locations: 100 Litte Collin Street, 50 Collin Street, 10 Elizabeth Street, and 10 Swanston Street.
fitler_address_values = ["100 Litte Collin Street", "50 Collin Sreet", "10 elizabeth street", "10 Swanston Street"]
visualization(fitler_address_values, live_counting_data = counting_data,radius_value = 200)
Congratulations!
You've successfully used Melbourne Open Data to visualise pedestrian traffic in and around the City of Melbourne!
!jupyter nbconvert --to html Pedestrian_Traffic_Analysis_Final.ipynb
References